Skip to main content
POST
/
api
/
v1
/
marker
[DEPRECATED] Marker
import requests

url = "https://www.datalab.to/api/v1/marker"

files = { "file.0": ("example-file", open("example-file", "rb")) }
payload = {
    "file_url": "<string>",
    "mode": "fast",
    "max_pages": "123",
    "page_range": "<string>",
    "langs": "<string>",
    "force_ocr": "false",
    "format_lines": "false",
    "paginate": "false",
    "add_block_ids": "false",
    "include_markdown_in_chunks": "false",
    "strip_existing_ocr": "false",
    "disable_image_extraction": "false",
    "disable_image_captions": "false",
    "fence_synthetic_captions": "false",
    "disable_ocr_math": "false",
    "use_llm": "false",
    "output_format": "<string>",
    "token_efficient_markdown": "false",
    "skip_cache": "false",
    "save_checkpoint": "false",
    "block_correction_prompt": "<string>",
    "page_schema": "<string>",
    "segmentation_schema": "<string>",
    "additional_config": "<string>",
    "workflowstepdata_id": "123",
    "extras": "<string>",
    "webhook_url": "<string>",
    "pipeline_id": "<string>",
    "run_eval": "false",
    "file": "<string>"
}
headers = {"X-API-Key": "<api-key>"}

response = requests.post(url, data=payload, files=files, headers=headers)

print(response.text)
{
  "request_id": "<string>",
  "request_check_url": "<string>",
  "success": true,
  "error": "<string>",
  "versions": {}
}

Authorizations

X-API-Key
string
header
required

Cookies

wos-session
string
access_token
string
datalab_active_team
string

Body

multipart/form-data
file_url
string | null

Optional file URL (http/https). If provided, the server will download and process it.

mode
string
default:fast

Which output mode to use. Valid values: 'fast' (lowest latency, great for real-time use cases), 'balanced' (balanced accuracy and latency, works well with most documents), 'accurate' (highest accuracy and latency, good on the most complex documents).

max_pages
integer | null

The maximum number of pages in the PDF to convert.

page_range
string | null

The page range to parse, comma separated like 0,5-10,20. This will override max_pages if provided. Example: '0,2-4' will process pages 0, 2, 3, and 4.

langs
string | null

Note: This parameter has been deprecated, and will be ignored in the current version. The languages to use if OCR is needed, comma separated. Must be either the names or codes from https://github.com/datalab-to/surya/blob/master/surya/languages.py. Any other inputs will be ignored.

force_ocr
boolean
default:false
deprecated

[DEPRECATED] This parameter is deprecated and has no effect. OCR is handled automatically by the parsing pipeline.

format_lines
boolean
default:false
deprecated

[DEPRECATED] This parameter is deprecated and has no effect. Line formatting is handled automatically by the parsing pipeline.

paginate
boolean
default:false

Whether to paginate the output. Defaults to False. If set to True, each page of the output will be separated by a horizontal rule that contains the page number (2 newlines, {PAGE_NUMBER}, 48 - characters, 2 newlines).

add_block_ids
boolean
default:false

Add data-block-id attributes to HTML elements for citation tracking. Only applies when output_format includes 'html'.

include_markdown_in_chunks
boolean
default:false

Include markdown field in chunks and JSON output. When enabled, each chunk will have a 'markdown' field with the markdown representation of that block. Only applies when output_format includes 'json' or 'chunks'.

strip_existing_ocr
boolean
default:false
deprecated

[DEPRECATED] This parameter is deprecated and has no effect. OCR handling is managed automatically by the parsing pipeline.

disable_image_extraction
boolean
default:false

Disable image extraction from the PDF. If use_llm is also set, then images will be automatically captioned. Defaults to False.

disable_image_captions
boolean
default:false

Disable synthetic image captions/descriptions in output. Images will be rendered as plain img tags without alt text or the img-description wrapper div. Defaults to False.

fence_synthetic_captions
boolean
default:false

Wrap synthetic image captions in markdown with HTML comment markers ( ... ) for easy identification/removal. Only applies to markdown output.

disable_ocr_math
boolean
default:false
deprecated

[DEPRECATED] This parameter is deprecated and has no effect. Math recognition is handled automatically by the parsing pipeline.

use_llm
boolean
default:false
deprecated

[DEPRECATED] This parameter is deprecated. Use the 'mode' parameter instead: 'balanced' or 'accurate' modes.

output_format
string | null

The output format for the text. Can be 'json', 'html', 'markdown', or 'chunks'. Defaults to 'markdown'. You can comma separate multiple formats, like markdown,html.

token_efficient_markdown
boolean
default:false

When enabled, the markdown output uses token-efficient formatting optimized for LLMs (compact tables with single-dash headers, single-space list indents).

skip_cache
boolean
default:false

Skip the cache and re-run the inference. Defaults to False. If set to True, the cache will be skipped and the inference will be re-run.

save_checkpoint
boolean
default:false

Save the checkpoint after processing. Defaults to False. This is only useful if you're applying custom rules iteratively.

block_correction_prompt
string | null
deprecated

[DEPRECATED] This parameter is deprecated and has no effect. Block correction is not currently supported.

page_schema
string | null

The schema to use for structured extraction (only used with structured extraction endpoint). The ideal way to generate this is to create a Pydantic schema, then convert to JSON with .model_dump_json().

segmentation_schema
string | null

The schema to use for document segmentation. Should be a JSON string containing segment names and descriptions for identifying page ranges of different document sections.

additional_config
string | null

Additional configuration options as a JSON string. Only these keys have effect: 'keep_pageheader_in_output' (bool), 'keep_pagefooter_in_output' (bool), 'keep_spreadsheet_formatting' (bool).

workflowstepdata_id
integer | null

Optional workflow step data ID. If provided, this request will be associated with the specified workflow step execution.

extras
string | null

Comma-separated list of extra features to enable. Currently supports: 'track_changes', 'chart_understanding', 'table_row_bboxes', 'extract_links', 'infographic', 'new_block_types'.

webhook_url
string | null

Optional webhook URL to call when the request is complete. If provided, this will override the webhook URL stored in your account settings for this specific request.

pipeline_id
string | null

Optional custom pipeline ID. If provided, will execute the custom pipeline configuration associated with this ID.

run_eval
boolean
default:false

Internal: run evals over custom pipeline.

file
file | null

Input PDF, word document, powerpoint, or image file, uploaded as multipart form data. Images must be png, jpg, or webp format.

Response

Successful Response

request_id
string
required

The ID of the request. This ID can be used to check the status of the request.

request_check_url
string
required

The URL to check the status of the request and get results.

success
boolean
default:true

Whether the request was successful.

error
string | null

If the request was not successful, this will contain an error message.

versions

A dictionary of the versions of the libraries used in the request.